Support Unicode international characters #151

WesTyler · 2017-02-07T20:24:43Z

Punycode domains
utilize .codePointAt instead of .charCodeAt
tighten restricted codes to control characters instead of non-latin ranges

RFC references:

Internationalization of email addresses: RFC 6530
- Both local and domain portions should support all Unicode characters with the exception of C0 and C1 control characters
- Unicode characters in domains should be downgraded (via punycode) when interacting with ASCII-only systems, as when resolving MX records in the case of isemail

Addresses #17

- Punycode domains - utilize .codePointAt instead of .charCodeAt - tighten restricted codes to control characters instead of non-latin ranges

WesTyler · 2017-02-07T20:25:41Z

@skeggse this is going to fail linting because of the let Dns issue with rewire.

Please let me know if I missed anything, or if this is not what you had in mind. I'm happy to push updates to this PR.

WesTyler · 2017-02-07T20:26:42Z

Actually, it occurs to me that I can escape that line from the linter. I'm going to push a commit for that.

skeggse

Would you please add a few tests that verify the correct byte length checking behavior for long unicode email addresses - iirc the length limits on email addresses are in numbers of octets, as captured here and here.

Can you comment on whether this line is still correct, given the check against 127? Same for this line, this one, and this one.

Please bump the version to at least 2.3.0. We might consider either going to 3.0.0, or adding an allowUnicode flag to selectively enable this (which would be more complicated). It's possible others depend on this module rejecting unicode email address, and it's also possible that they didn't consider it, and will start having issues if they accept a unicode email address, but can't handle it in their delivery mechanism.

EDIT: sorry for making these concerns kinda late in the process - they did not occur to me sooner.

skeggse · 2017-02-07T21:09:02Z

test/tmp.js

+            checkDNS: true
+        }, () => {
+
+            expect(internals.punyDomain).to.equal('xn--5nqv22n.xn--lhr59c');


This works, but depends on the exact behavior of punycode. I'd rather not test whether punycode did the right thing, just whether validate called punycode correctly. Instead, could you put this?

expect(punycode.toUnicode(internals.punyDomain)).to.equal('伊昭傑@郵件.商務');

WesTyler · 2017-02-07T21:25:12Z

Thanks for the quick review. Addressing these issues now. 👍

skeggse · 2017-02-07T21:30:36Z

Ah, I also just noticed RFC 6532, which, in Section 3.1 discusses the appropriate normalization form for the email address. It looks like String#normalize implements the appropriate conversion - do you think it would be appropriate to preprocess the email address by calling normalize on it?

WesTyler · 2017-02-07T21:35:22Z

I missed RFC 6532... Reading through it now.

WesTyler · 2017-02-07T21:51:43Z

Thanks @robhorrigan :D

@skeggse re: normalization - I don't know that it is necessary, and it may cause some unintended side effects. I'm looking at Section 4 of RFC 6532. Specifically the fact that normalization can cause the length of the address to exceed requirements/overflow buffers in edge cases.

Marsup

Commented on the how, not the what, as you know more than me after reading that RFC :)

@skeggse you should keep versioning under your control and not part of this PR.

Marsup · 2017-02-07T22:22:05Z

lib/index.js

+
+    // add C0 control characters
+
+    for (let i = 0; i < 33; ++i) {


Since this package only supports node 4+, you can probably use lookup.fill to initialize it.

Marsup · 2017-02-07T22:23:14Z

package.json

@@ -18,10 +18,12 @@
    "node": ">=4.0.0"
  },
  "dependencies": {
+    "punycode": "^2.1.0"


Keep hapi versioning scheme (2.1.x).

Marsup · 2017-02-07T22:25:10Z

test/tmp.js

@@ -0,0 +1,47 @@
+'use strict';


tmp.js ??

@skeggse's preference is to utilize a proper internationalized domain with an active MX record. I have struck out on finding such a domain, so this approach was his backup plan. I'm guessing tmp.js was so that it doesn't become the permanent testing solution by default. :P

Marsup · 2017-02-07T22:31:23Z

lib/index.js

@@ -2,8 +2,8 @@

 // Load modules

-const Dns = require('dns');
-
+let Dns = require('dns'); // eslint-disable-line prefer-const


I think proxyquire would allow you to keep that const.

@skeggse - Do you prefer rewire or is it worth it to utilize proxyquire to prevent the ESLint override?

WesTyler · 2017-02-07T23:06:06Z

@robhorrigan and I have started on addressing the changes you requested @skeggse. Planning on having the additional tests and refactors up tomorrow AM. We're close :)

skeggse · 2017-02-07T23:28:16Z

@WesTyler after rereading Section 4 of RFC 6532, I'm not sure I agree with your comment. I read that section as saying that normalization should mitigate some of the security issues.

I'm also concerned about introducing this change without also providing a warning to users reminding them to normalize Unicode emails. Thoughts?

Also thanks for the input, @Marsup. I'll update the version separately. @WesTyler is right about tmp.js, though I continue to be open to a better file name.

Per proxyquire, I'm not sure. I prefer it when we don't make changes to source specifically for testing.

Marsup · 2017-02-07T23:33:07Z

proxyquire acts on the require cache, it seems better than rewire for your usage.

skeggse · 2017-02-07T23:37:04Z

Ah, I think I misread the docs for proxyquire when I skimmed them. Yeah, that looks good.

WesTyler · 2017-02-08T01:01:59Z

@skeggse You're absolutely right about the normalization. I misread that section and missed the key part:

The normalization process described
in Section 3.1 is recommended to minimize these issues.

Here are my action items for this PR:

Address conditional statements to ensure that the correct charCode is caught in the correct places.
Update test case in temp.js as requested.
Pull out rewire and implement proxyquire instead. This will remove the linting exception.
Add preprocessing normalization of emails.

WesTyler · 2017-02-08T16:52:50Z

@skeggse - need to call in the big guns on this.
I am verifying the charCode conditional here. My interpretation based on that range is that it is targeting C1 controls except for 127 (delete) (the C1 range is 127-159).
The test case that covers this block is "\"test\\©\"@iana.org". Was the \© intentionally chosen based on the RFC, or was it just a character with Unicode decimal value >127?

skeggse · 2017-02-08T17:07:04Z

This project was originally ported from the is_email php function. I pulled nearly all the test-cases from that project. My guess is that it was only chosen because it was outside the ASCII range.

WesTyler · 2017-02-08T19:00:31Z

@skeggse Ok, we are at the last change requested - normalization.

It looks like Node.js is already handling UTF-8 encoded unicode characters.
'\u0101' === 'ā' // true
Also, we tried to set up a test to assert that normalization is taking place as expected by calling Isemail.validate('test\u0101@\u0101.com'). The parameter at this point in the code already has the Unicode character in place instead of the UTF-8 string.
I have been unable to replicate an email address with UTF-8 encoding of Unicode characters for which calling email.normalize() had any effect.

Thoughts??

skeggse · 2017-02-08T21:14:36Z

JavaScript strings are UCS-2 encoded, not UTF-8, so you'll need to try a character that needs to be represented as a surrogate pair in UCS-2.

WesTyler · 2017-02-08T21:18:10Z

Oh geez. It's been a long day... XD

WesTyler · 2017-02-08T23:30:03Z

Ok, so, this is as close as @robhorrigan and I have been able to get for writing tests to cover normalization:

> punycode.toUnicode(punycode.toASCII('man\u0303ana.com')) === 'mañana.com'
false
> punycode.toUnicode(punycode.toASCII('man\u0303ana.com'.normalize())) === 'mañana.com'
true

So our thought is to pass in the email 'testing@man\u0303ana.com' in tmp.js and assert

expect(punycode.toUnicode(internals.punyDomain)).to.equal('mañana.com');

Does that meet your expectations for normalize?

skeggse · 2017-02-08T23:32:10Z

Yeah that'll do for now.

- adds UTF-16 surrogate pair characters to tests.json - adds exported `.normalize` method and associated test - adds test in `tmp.js` for punycoded normalized characters

WesTyler · 2017-02-09T16:37:39Z

@skeggse - it looks like the behavior for String.prototype.normalize changes in the V8 versions used between Node v4.x and Node v6.x and it is breaking existing tests containing NULL characters \u0000.

https://github.com/v8/v8/blob/master/ChangeLog#L728

Node v6.x (all existing tests pass):

> '\"test\u0000\"@iana.org'.normalize()
'"test\u0000"@iana.org'

Node v4.x (normalize breaks existing tests with NULL):

> '\"test\u0000\"@iana.org'.normalize()
'"test'

[EDIT] Working on a nulNormalize helper to effectively backport the V8 fix to Node v4.x
Here's the V8 commit that corrected the normalize functionality

skeggse · 2017-02-10T18:15:12Z

Ok, awesome! (technically there's no monkey-patching in that last commit, but it is certainly a workaround)

I'll review the whole thing one more time this evening.

WesTyler · 2017-02-10T19:07:50Z

Ha my bad, I guess you're right. I was thinking monkey-patch since it was checking process.version, but that doesn't mean it's modifying the existing .normalize functionality at runtime. :P

Thanks for the patience and help with this throughout the week

lamchakchan · 2017-02-10T19:37:51Z

lib/index.js

@@ -1413,5 +1426,12 @@ exports.diagnoses = internals.validate.diagnoses = (function () {

 exports.normalize = internals.normalize = function (email) {

+    // $lab:coverage:off$
+    if (process.version[1] === '4' && email.match(/\0/g)) {


I would recommend replacing regex with a simple indexOf for performance reasons.

if (process.version[1] === '4' && email.indexOf('\u0000') === 0) {

On it. Just haven't pushed up yet :)
Also, using email.indexOf('/u0000') >= 0 because the NUL is not necessarily at the beginning

skeggse

The loop that iterates over the characters in the string won't handle surrogate pairs. Is this appropriate? It seems like it would be better to consume an entire unicode code point for each iteration of that loop. Furthermore, on the next line, the token = email[i]; won't handle surrogate pairs either. Maybe token = email.codePointAt(i) and i += token.length or something. Please double-check the use of i as an iterator, including its uses inside the loop.

I added some length checks that fail, as they don't consider the octet count correctly.

WesTyler · 2017-02-11T16:53:26Z

Ok, I think I found a way to check for surrogate pairs. I've modified the iterator to grab the entire pair as the token and then "skip" the second octet. I'm pushing up that commit so you can check the modification, but the tests are still failing.

My question now is about the test cases and the asserted results. For the surrogate pair \ud83d\ude06, that should be considered a single character when calculating the email length, right? If so, then I believe the test cases you added may be asserting incorrectly. It should take 65 of the \ud83d\ude06 pairs to trigger the "rfc5322LocalTooLong" error, not 17... I have similar concerns about the other 4 tests cases if I'm right about this one.

Also, if that is the case, I need to update the way the parseData.local.length and parseData.domain.length lengths are being checked to account for '\ud83d\ude06'.length === 2 instead of the desired '\ud83d\ude06'.length === 1

skeggse · 2017-02-11T17:27:11Z

It's the exact opposite, actually. The original rules about limits on the number of octets for, say, the entire local part still apply. As a result, a should could as one octet, but \ud83d\ude06 should count as four octets.

I just realized the tests for the domain part might be wrong, though - the punycode version is what would actually be sent over the wire.

WesTyler · 2017-02-11T17:28:27Z

\ud83d\ude06 should count as four octets

Wait, what?? What in the world am I reading horribly wrong?? Haha

skeggse · 2017-02-11T17:30:37Z

Buffer.byteLength('\ud83d\ude06', 'utf8') === 4

WesTyler · 2017-02-11T17:38:56Z

Ohhhhhhh. I see now. Got it, thank you.

I updated the iterator, but I still do need to account for the difference here:

> '\ud83d\ude06'.length
2
> Buffer.byteLength('\ud83d\ude06', 'utf8')
4

skeggse · 2017-02-11T17:47:05Z

Try

for (let i = 0; i < emailLength; i += token.length) {
  token = email.codePointAt(i);

WesTyler · 2017-02-11T18:12:53Z

Is there something wrong or undesirable with the iterator update in this commit?

WesTyler · 2017-02-11T18:15:27Z

If not, then I think my latest update (checking Buffer.byteLength instead of string.length) may be the final piece.
I had to update a couple of my international character tests because

> Buffer.byteLength('郵', 'utf8')
3

skeggse · 2017-02-11T18:22:41Z

The method you used to pull surrogate pairs out of the string functions. That said, it doesn't add anything - it makes the code harder to read, and I'd prefer readable and concise code.

It's also misleading: within the loop, we expect i to reference the start of the current character being processed. That said, the way we use it is inconsistent with surrogate pairs.

There are a number of lines that look like:

if (emailLength === ++i || email[i] !== '\n') {

This will break in my proposed iterator. I'd still prefer to move in that direction specifically because it makes i more consistent with my expectations. I can make these changes to your branch, if you want.

WesTyler · 2017-02-11T18:32:30Z

Alright, that's fair.
token = email.codePointAt(i); doesn't work though. The rest of the logic requires that token is the actual substring of the email (like '\ud83d\ude06').

I'll see what I can do to clean it up though.

skeggse · 2017-02-11T18:35:24Z

Ah, crud. You're right. token = String.fromCodePoint(email.codePointAt(i));

WesTyler · 2017-02-11T18:40:17Z

I can make these changes to your branch, if you want.

I would be on board with that if you have the time. Gotta try to get some time in with my wife and baby daughter this weekend :)

WesTyler · 2017-02-15T18:35:03Z

@skeggse I had the chance to replace my internals.checkSurrogatePair with the

for (let i = 0; i < emailLength; i += token.length) {
  token = String.fromCodePoint(email.codePointAt(i));

I didn't end up needing to change any of the iterator-related code, only the logic in quote pairs that prepends '\\' to the token.

Is this closer to what you were hoping to see?

skeggse · 2017-02-15T22:27:03Z

Wonderful! I'll get these out on 2.3.0. Sorry for last weekend - I meant to take care of this but too many other things piled up.

WesTyler · 2017-02-15T22:28:58Z

Hey, no worries! I ended up having some bandwidth today and it wasn't as involved as I was worried it would be.

Thanks again for all of the patience and help with this :D

This was left over from some tests introduced in #151 (0f795c5) and removed in #153 (ac9acbd).

skeggse · 2018-10-11T01:47:08Z

lib/index.js

-                            // Check if it's a neither a number nor a latin letter
-                        else if (charCode < 48 || charCode > 122 || (charCode > 57 && charCode < 65) || (charCode > 90 && charCode < 97)) {
+                            // Check if it's a neither a number nor a latin/unicode letter
+                        else if (charCode < 48 || (charCode > 122 && charCode < 192) || (charCode > 57 && charCode < 65) || (charCode > 90 && charCode < 97)) {


@WesTyler I'm reading through some of this code as I refactor, and now I find myself asking where 192 came from. You wouldn't happen to recall, would you?

Yeah, I believe 192 is the beginning of the Unicode "international" character set (À).

If I remember correctly, Unicode #s 123-191 are all non-character symbols.

Ahh, ok. Thanks!

Support Unicode international characters

b4566ba

- Punycode domains - utilize .codePointAt instead of .charCodeAt - tighten restricted codes to control characters instead of non-latin ranges

Disable linting for rewire "let" workaround

7c7901f

skeggse suggested changes Feb 7, 2017

View reviewed changes

Add inline comment for eslint ignore on prefer-const rule

0ab6884

Marsup reviewed Feb 7, 2017

View reviewed changes

WesTyler added 5 commits February 8, 2017 14:50

Expand test coverage of Unicode cases

3487199

Utilize array.fill; do not pin dependency at the patch level

fbd9f1c

Specify C0 and C1 charCode ranges to allow unicode

86c3194

Replace rewire with proxyquire

79e2261

Remove "let" and eslint override introduced by rewire useage

24f78ef

Add normalization to email addresses

996a71e

- adds UTF-16 surrogate pair characters to tests.json - adds exported `.normalize` method and associated test - adds test in `tmp.js` for punycoded normalized characters

lamchakchan reviewed Feb 10, 2017

View reviewed changes

WesTyler and others added 2 commits February 10, 2017 13:42

Utilize indexOf instead of regex match for speed

10bb502

Add unicode length tests

38039b8

skeggse suggested changes Feb 11, 2017

View reviewed changes

Check for surrogate pairs in token iteration

599a245

Check Buffer.byteLength instead of string.length

1fdcf9e

Deprecate internals.checkSurrogatePair

1d20f07

skeggse approved these changes Feb 15, 2017

View reviewed changes

skeggse merged commit 0f795c5 into skeggse:master Feb 15, 2017

skeggse pushed a commit that referenced this pull request Jun 22, 2017

Remove proxyquire

f2e8a78

This was left over from some tests introduced in #151 (0f795c5) and removed in #153 (ac9acbd).

skeggse reviewed Oct 11, 2018

View reviewed changes

skeggse mentioned this pull request Oct 21, 2018

Validate local part unicode correctness #192

Closed

danschultzer mentioned this pull request Aug 20, 2019

E-mail validation doesn't accept addresses with unicode (RFC 6531) pow-auth/pow#253

Closed

Support Unicode international characters #151

Support Unicode international characters #151

Conversation

WesTyler commented Feb 7, 2017

WesTyler commented Feb 7, 2017

WesTyler commented Feb 7, 2017

skeggse left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WesTyler commented Feb 7, 2017

skeggse commented Feb 7, 2017

WesTyler commented Feb 7, 2017

WesTyler commented Feb 7, 2017

Marsup left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WesTyler Feb 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WesTyler commented Feb 7, 2017

skeggse commented Feb 7, 2017

Marsup commented Feb 7, 2017

skeggse commented Feb 7, 2017

WesTyler commented Feb 8, 2017 • edited Loading

WesTyler commented Feb 8, 2017

skeggse commented Feb 8, 2017

WesTyler commented Feb 8, 2017 • edited Loading

skeggse commented Feb 8, 2017

WesTyler commented Feb 8, 2017

WesTyler commented Feb 8, 2017

skeggse commented Feb 8, 2017

WesTyler commented Feb 9, 2017 • edited Loading

skeggse commented Feb 10, 2017

WesTyler commented Feb 10, 2017

lamchakchan Feb 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skeggse left a comment

Choose a reason for hiding this comment

WesTyler commented Feb 11, 2017 • edited Loading

skeggse commented Feb 11, 2017

WesTyler commented Feb 11, 2017 • edited Loading

skeggse commented Feb 11, 2017

WesTyler commented Feb 11, 2017

skeggse commented Feb 11, 2017

WesTyler commented Feb 11, 2017

WesTyler commented Feb 11, 2017

skeggse commented Feb 11, 2017 • edited Loading

WesTyler commented Feb 11, 2017

skeggse commented Feb 11, 2017

WesTyler commented Feb 11, 2017

WesTyler commented Feb 15, 2017

skeggse commented Feb 15, 2017

WesTyler commented Feb 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skeggse left a comment •

edited

Loading

WesTyler Feb 7, 2017 •

edited

Loading

WesTyler commented Feb 8, 2017 •

edited

Loading

WesTyler commented Feb 8, 2017 •

edited

Loading

WesTyler commented Feb 9, 2017 •

edited

Loading

lamchakchan Feb 10, 2017 •

edited

Loading

WesTyler commented Feb 11, 2017 •

edited

Loading

WesTyler commented Feb 11, 2017 •

edited

Loading

skeggse commented Feb 11, 2017 •

edited

Loading